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Biclustering, the process of simultaneously clustering observa- 
tions and variables, is a popular and effective tool for finding structure 
in a high-dimensional dataset. A variety of biclustering algorithms ex- 
f*—) ist, and they have been applied successfully to data sources ranging 

from gene expression arrays to review-website data. Currently, while 
biclustering appears to work well in practice, there have been no 
theoretical guarantees about its performance. We address this short- 
coming with a theorem providing sufficient conditions for asymptotic 
consistency when both the number of observations and the number 
of variables in the dataset tend to infinity. This theorem applies to a 
broad range of data distributions, including Gaussian, Poisson, and 
i i Bernoulli. We demonstrate our results through a simulation study 

[J_] and with examples drawn from microarray analysis and collaborative 

filtering. 

1. Introduction. Suppose we are given a data matrix X = [Xij], and 
!— 1 our goal is to cluster the rows and columns of X into meaningful groups. For 

! example, X^ could be the log activation level of gene j in patient i; our goal is 

J> to seek groups of patients with similar genetic profiles, while at the same time 

finding groups of genes with similar activation levels. Alternatively, Xij can 
indicate whether or not user i reviewed movie j; our goal is to simultaneously 
cluster the users and the movies. Mirkin (1996) termed the general clustering 
\Q process "biclustering," but it is also known as direct clustering (Hartigan, 

1972), block modeling (Arabie, Boorman and Levitt, 1978), and co-clustering 
(Dhillon, 2001). 

Empirical results from a broad range of disciplines indicate that biclus- 
tering data is useful in practice. For example, Ungar and Foster (1998) and 
^ Hofmann (1999) found that biclustering helps identify structure in latent 

class models for collaborative filtering problems where the data is sparse 
and diverse tastes make it difficult to cluster purely based on purchasing or 
review habits. 

Several biclustering applications exist in the biological sciences. Eisen 
et al. (1998) was one of the first papers to note the benefits of clustering genes 
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and conditions in microarray data, finding that genes with similar functions 
cluster together. Recently, Harpaz et al. (2010) applied biclustering methods 
to a Food and Drug Adminstration report database, identifying associations 
between certain active ingredients and adverse medical reactions. Several 
other applications of biclustering exist; see Cheng and Church (2000), Getz, 
Levine and Domany (2000), Lazzeroni and Owen (2002), and Kluger et al. 
(2003) as well as Madeira and Oliveira (2004) for a comprehensive survey. 

Although their goals are the same, the references above use a variety of 
different biclustering algorithms. Clearly, many of these algorithms work well 
in practice, but they are often ad-hoc, and there are no rigorous guarantees 
as to their performance. In particular, the lack of any notion of consistency 
means that practitioners cannot be assured that their discoveries from bi- 
clustering will generalize or be reproducible; collecting more data may lead 
to completely different biclusters. 

Our first objective in this report is to establish a probabilistic model 
for the data matrix, so that biclustering can be formalized as an estimation 
problem. Once we have done this, we study a class of biclustering algorithms 
based on profile- likelihood, as proposed by Ungar and Foster (1998). We 
show that these profile-based procedures are asymptotically consistent as 
the dimensions of the matrix tend to infinity, under weak assumptions on 
the elements of X . Notably, our methods can handle both dense and sparse 
data matrices. To our knowledge, this is the first general consistency result 
for a biclustering algorithm. 

Our work was inspired by recent developments in clustering methods for 
undirected networks. In that context, X is a symmetric binary matrix, and 
the clusters for the rows of X are the same as the clusters for the columns 
of X. Bickel and Chen (2009) proposed using Holland, Laskey and Lein- 
hardt's (1983) stochastic block model for the data matrix. They studied 
a general class of clustering algorithms and derived sufficient conditions for 
these algorithms to be consistent. Choi, Wolfe and Airoldi (2012) generalized 
this result by allowing the number of blocks to increase with the network 
size, Zhao, Levina and Zhu (2011a) allowed for nodes that do not belong 
to any community, and Zhao, Levina and Zhu (2011b) relaxed the assump- 
tion of stochastic equivalence within each block by incorporating individual 
effects. Rohe, Chatterjee and Yu (2011) also studied the stochastic block 
model with an increasing number of blocks and established a bound on the 
number of misclassified nodes for spectral clustering. 

In the setting of biclustering for directed networks Rohe and Yu (2012) 
considered a stochastic block model and proposed the "dice 'em" bicluster- 
ing algorithm (DI-SIM), which "dices" the rows and columns of a matrix 
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into clusters using its left and right singular vectors. They derived consis- 
tency results for this algorithm but required the expected degree to grow 
sufficiently fast. Our theoretical results do not require this restriction and 
are not limited to directed networks. 

The remainder of the paper is organized as follows. Section 2 describes 
the block model and defines the profile-likelihood. Section 3 contains the 
main theoretical results and Section 4 presents the main steps of the proof. 
Section 5 corroborates the theoretical findings through the use of asymptotic 
simulations and the results suggest that the profile-likelihood outperforms 
other biclustering methods. Section 6 looks at applications to real data and 
the results from biclustering experiments done on a movie-review and a 
microarray dataset are reported. Section 7 presents some concluding remarks 
and the appendix includes additional technical and empirical results. 

2. Block models and profile likelihood. As above, let X = [Xij] G 
W nxn be a data matrix. One way to formalize the biclustering problem is 
to posit the existence of K row classes and L column classes, such that 
the mean value of entry Xij is determined solely by the classes of row i 
and column j. That is, there is an unknown row class membership vector 
c G K m , an unknown column class membership vector d G L n , and an 
unknown mean matrix M = [fj, k i] G R KxL such that 

E X^ = fic-dj ; 

we refer to this model as a block model, after the related model for undirected 
networks proposed by Holland, Laskey and Leinhardt (1983). With a block 
model, our goal is to estimate c and d. We do this by assigning labels to the 
rows and columns of X, codified in vectors g G K m and h G L n . Ideally, g 
and h match c and d. Note that we are assuming that the true numbers of 
row and column clusters, K and L, are fixed and known. 

We can employ profile-likelihood (Murphy and van der Vaart, 2000) for 
the biclustering task. Initially, we propose a simple distribution for X, and 
we derive the maximum profile likelihood estimator for c and d under this 
model. Later, we consider a far more general distribution for X , and we 
show that the simple profile- likelihood based estimator is still consistent, 
even under model misspecification. 

Suppose that the elements of X are independent draws whose distribution 
is specified by a single-paremeter exponential family. Conditional on c and 
d, entry Xij has density gix;^^) with respect to some cr-finite measure v, 
where 

g(x;r]) = ex.p{xr) - i/)(rf)}; 
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ip(r]) is the cumulant function, and r/ki = (i/j r ) 1 (pki)- With labels g and h, 
the complete data log-likelihood is 

l(g,h,M) = mn^2p k qi{X kl <q k i - ip(r]kl)}, 

k,l 

where 

Pk = — y^I(ft = k), qi = = 0, 

* i 

and 

= Eij x u % = fc > = 
Eij 1 ^* = h i = 

The profile log-likelihood is 

pl(g,/i) = supl(g,h,M) = miiV ]pkqiip*{X k i), 

M Tt 

where ij)*(x) = sup v {xr] — ip(rj)} is the convex conjugate of ip, known as the 
rate function in the large deviations literature (Dembo and Zeitouni, 1998). 
Following the above derivation, it is natural to estimate c and d by the label 
vectors which maximize pl(g, h). 

In the sequel, we consider a far more general setting. We consider criterion 
functions of the form 

(2-1) F(g,h) = mn^2p k qi f(X k i/p), 

kl 

where / is any smooth convex function and p is a scale parameter. Borrowing 
terminology from the large deviations literature, we refer to / as the rate 
function of the criterion; though, since we allow / to take negative values, we 
do not require that / be a rate function in the strictest sense. We permit the 
elements of X to have different distributions, allowing for heteroscedasticity 
and model misspecification. We show that under mild technical conditions, 
the maximizer of F is a consistent estimator of the true row and column 
classes. 

3. Consistency results. To prove consistency, we need to work with 
a sequence X n of data matrices. Suppose that X n G K mx ™ anc l m = m(n) 
with n/m — > 7 for some finite constant 7. Suppose that for each n there 
exists a row class membership vector c n G K rn and a column class mem- 
bership vector d n G L n . We ctssume that th.6 components of c n are assigned 
independently by repeatedly drawing from a multinomial distribution with 
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class probability vector p G M. K ; similarly, the components of d n are assigned 
independently by repeatedly drawing from a multinomial distribution with 
class probability vector q G M. L . When there is no ambiguity, we omit the 
subscript n. 

Assume that the mean of element Xij depends only on the row and column 
memberships q and dj, so that 

E(Xij | c, d) = Hadj 

for some matrix M = [pki] £ M KxL , possibly varying with n. To model 
sparsity in X, we allow M to tend to 0. To avoid degeneracy, we suppose 
that there exists a sequence p and a fixed matrix Mq G M. KxL such that 
p~ x M -> Mq. 

Define the normalized confusion matrices C(g) G M. KxK and D(h) G 

R LxL by 

C a k(9) = — y~)l(<H = a,gi = k), D u (h) = - V ]l(dj = b,hj = I). 
* j 

Entry C a t is the proportion of nodes with class a and label k; entry 

is defined similarly. These matrices are normalized so that C T 1 = p and 

D T 1 = q. 

We only consider nontrivial partitions; to this end, for e > 0, define 

Je = {g,h : p k {g) > e, qi{g) > e}. 

For fixed convex rate function /, we let F be a criterion function as 
in (2.1). 

3.1. Assumptions. Denote by Aio the convex hull of the entries of Mq. 
Let M be a neighborhood of Aio- We require the following assumptions: 

(Al) The biclusters are identifiable. No two rows of Mq are equal, and 

no two columns of Mq are equal. 
(A2) The rate function is locally strictly convex. That is, f"(p) > 

when p G M.. 

(A3) The third derivative of the rate function is locally bounded. 

That is, \f"'(p)\ is bounded when p G M.. 
(A4) The average variance of the elements is of the same order as 

?• 

limsup V] E[(Xy - p Cldl ) 2 | c, d] < oo. 

rwoo pmn 3 
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(A5) The mean matrix does not converge to zero too quickly. That 

is, limsup n _ >00 pn = oo. 
(A6) The elements satisfy a Lindeberg condition. For all e > 0, 

A variant Lyaponuv's condition (Varadhan, 2001) implies (A6). That is, 

if 

lim r-^, 2+5 Y\ E \X tJ - » Cidj \ 2+S = 
i,0 

for some 8 > 0, then (A6) follows. In particular, for dense data (p bounded 
away from zero), uniformly bounded (2 + 5) absolute central moments for 
any 5 > is sufficient. For sparse data (Bernoulli or Poisson data with p 
converging to zero), (A5) is a sufficient condition for (A6). 

3.2. Main result. Given the above setup and assumptions, it follows that 
the maximizer of F(g, h) is a consistent estimator of the true row and column 
labels. 

Theorem 3.1. Fix any e > with e < min a {p a } and e < mm^qi,} . 
Let (g, h) satisfy F(g, h) = maxj- e F(g, h). If assumptions (A1)-(A6) hold, 
then all limit points of C(g) and D(h) are permutations of diagonal matri- 
ces, i.e. the proportions of mislabeled rows and columns converge to zero in 
probability. 

The statement of the theorem is somewhat awkward, because we can 
permute the row or column labels and get the same value of the criterion 
function, F. Thus, F has multiple maximizers, and it is only possibly to 
recover the true classes up to a permutation of labels. 

From the discussion in Section 2, it follows that maximizing the profile log- 
likelihood associated with a single-parameter exponential family may be a 
reasonable biclustering procedure. Furthermore, the proof of this result does 
not require the distribution of the data to be correctly specified and allows 
for possible model misspecification so long as (Al)-(A6) are satisfied. This 
implies that this result can be applied to binary matrices, count data, and 
continuous data which could be reasonably modeled by Binomial, Poisson, 
and Gaussian distributed data, respectively, making this a suitable result 
for our motivating examples. 

To prove the theorem, we first establish that in the limit, F is close to a 
nonrandom "population version," G. Then, we establish that G is maximized 
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at the true class labels. Finally, we show that outside any neighborhood 
around the true class labels, G is smaller than at the true values. Section 4 
provides the details. 

4. Proof of consistency theorem. In this section, we continue with 
the setup and notation of Section 3. To prove Theorem 3.1, we need to 
introduce some additional notation. 

Define expectation matrix E(g,h) G R KxL with 

E kl (g, h) = E(X kl (g, h) | c, d) - [C '' M D]k ' 



[C^IUD'^ 

where C = C(g) and D = D(h). Also, define normalized residual matrix 
R(g,h) G R KxL by 

R(g,h) = p- l {X{g,h) - E(g,h)}. 

The weak law of large numbers establishes that for fixed g and h, the con- 
vergence R k i(g,h) — >• holds. We can prove a stronger result, that this 
convergence is uniform over all g and h. 

Lemma 4.1. Under assumptions (A1)-(A6), for all e > 0, 

supH-R^/OHoo 4 0, 

where || Alloc = m & x k,i \Aki\ for any matrix A. 

With Lemma 4.1 (proved in Appendix A), we can establish that in the 
limit, F(g,h) is close to its "population version," which depends only on 
C and D. To define this population version, first, for each M, define Sm 
to be the set ofMxM matrices with nonnegative entries summing to one: 
S M = {X £ Rl IxM : 1 T X1 = 1}. Next, define function G Mo ■ S K xS L -»■ R 
to be the population version of F: 



G Mo (C,D) = J2[C T l]k[D T l] l f 



k.i 



[C T M D] kl 
[C T l] k [D T ll 



Lemma 4.2. F is close to its population version in the sense that, for 
all e > 0, 

S up\F(g,h)-G Mo (C,D)\^0. 
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This is a direct consequence of Lemma 4.1 and assumption (A3), which 
implies that / is locally Lipschitz; see Appendix A for details. 

Once we have established that F is close to its population version, our 
next task is to show that the population version is maximized at the true 
class labels. 

Lemma 4.3. For 5 > define 

V = {C G S K ■ m&xCakCa'k < 5} 

and 

Q = {DeS L : vaaxDuDvi < 5}. 
b^b' 

//min a {[Cl] a } > n, mm b {[Dl] b } > n, and(C,D) <£TxQ, thenG Mo (C,D) 
is small, in the sense that 

G Mo (C,D) - J2[Cl] a [Dl] b f([M ] ab ) < - K rj 2 8, 

a.b 

where k is a constant independent of 5 and rj. 

We prove Lemma 4.3 in Appendix A. Next, we prove the consistency theo- 
rem. 

Proof of Theorem 3.1. Fix 5 > and define V and Q as in Lemma 4.3. 
We will show that if (g,h) G J e and if (C(g),D(h)) $ (V,Q), then 
F(g, h) < F(c, d) with probability approaching one. Moreover, this inequal- 
ity holds uniformly over all such choices of (g,h). Since 5 is arbitrary, this 
implies that C(g) and D(h) converge to permutations of diagonal matrices, 
i.e. the proportions of misclassified rows and columns converge to zero. 

Set r n = sup Je \F(g,h)-GM (C(g),D(h)\. Suppose (g,h) G J £ . In this 
case, 

F(g, h) - F(c, d) < 2r n + {G Mo (C(g), D(h)) - G Mo (C(c), D(d))} 

= 2r n + {G Mo (C(g),D(h)) - £[Cl] a [Dl] 6 /([M ]a6)}. 

a.b 

Pick rj > smaller than mm a {p a } and min^qi,}. By assumption, the true 
row and column classes follow multinomial distributions with probabilities p 
and q. Thus, for all g G K m and h G L n , by the weak law of large numbers, 
[C{g)] a > 7/ and [D(h)] b > rj with probability tending to one as n increases; 
moreover, this holds uniformly over all choices of (g,h). 
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Applying Lemma 4.3, to the second term in the inequality, we get that 
with probability approaching one, 

F(g,h)-F(c,d) < 2r n - KV 2 6 

for all (g, h) G J £ such that (C{g),D{h)) (£ V x Q. By Lemma 4.2, r n -4 0. 
Thus, with probability approaching one, (C(g), D(h)) G V x Q. Since this 
result holds for all 5, all limit points of C(g) and D(h) must be permutations 
of diagonal matrices. □ 

5. Empirical evaluation. We study the performance of profile log- 
likelihood for biclustering Bernoulli, Poisson, and Gaussian data. For these 
three cases, the rate functions are as follow: 

(5.1a) /Bernoulli (A*) = filog fJl + (1 - fji) log(l - (i), 

(5.1b) /Poisson(^) = fJ,log fJ, - fJL, 

(5.1c) /Gaussian(/W) = A* 2 / 2 - 

We use the appropriate rate function depending on the context. 

For the Poisson and Gaussian rate functions (5.1b) and (5.1c), the maxi- 
mizer of the criterion function (2.1) does not depend on the scale factor p. 
For the Binomial rate function (5.1a), we use p = 1 in the fitting procedure, 
regardless of the true scale factor for the mean matrix M. Even though in 
the simulations assumption (Al) doesn't hold for this choice of p, we still get 
consistency, because the maximizer of the criterion with /Bernoulli ( A 4 ) is close 
to the maximizer with /p isson(A*)- See Perry and Wolfe (2012) for discussion 
of a related phenomenon. 

To maximize the profile log-likelihood, we compute initial partitions for 
the rows and columns by applying /c-means separately to the rows and the 
columns. Then, we iteratively update the cluster assignments in a greedy 
manner, based on the Kernighan-Lin heuristic (Kernighan and Lin, 1970) 
employed by Newman (2006): 

1. For each row and column, we compute the optimal label assignment 
while keeping the labels of all other rows and columns fixed; we also 
record the improvement made by making this assignment. 

2. In order of the local improvements recorded in step 1, we perform the 
label reassignments determined in step 1. Note that these assignments 
are no longer locally optimal since the labels of many of the rows and 
columns change during this step. 

3. Out of those labels considered in step 2, we choose the one which has 
the highest profile likelihood. 
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We iteratively perform steps 1-3 until the profile likelihood converges. The 
algorithm is not guaranteed to converge to a global optimum, but it seems 
to perform well in practice. 

In our simulations, we report the proportion of misclassified rows and 
columns by the profile-likelihood-based method (PL), which Theorem 3.1 
guarantees to be consistent. We also report misclassification errors for k- 
means applied separately to the rows and columns (KM) and for the DI-SIM 
biclustering algorithm (DS) of Rohe and Yu (2012). 

For the Poisson simulation, we simulate from a block model with K = 2 
row clusters and L = 3 column clusters. We vary the number of columns, 
n, between 200 to 1500 and we take the number of rows as m = jn where 
7 G {0.5, 1,2}. 

We set the row and column class membership probabilities as p = (0.3, 0.7) 
and q = (0.2, 0.3, 0.5). We choose the matrix of block parameters to be 

r , b /0.92 0.77 1.66\ 
M = [f " ab] = ^ \0.17 1.41 1.45,)' 

where b is chosen between 5 and 20; the entries of the matrix were chosen 
randomly, uniformly on the interval [0,2]. Note that M — > 0, so the data 
matrix is sparse. We generate the data conditional on the row and column 
classes as Xij \ c, d ~ Poisson(// c .^ j ). 

Figure 1 presents the average bicluster misclassification rates for each sam- 
ple size and Table 1 reports the standard deviations. In all of the scenarios 
considered, the biclustering based on the profile log-likelihood criterion per- 
forms at least as well as the other methods and shows signs of convergence. 
Although the /c-means algorithm and the DI-SIM algorithm perform well 
in some settings, in other settings the performance is poor and does not 
show signs of convergence. We also see that the standard deviation of the 
misclassification rate tends to zero for the PL bicustering, but not for the 
other two algorithms. 

Appendix B describes in detail the simulations for Bernoulli and Gaussian 
data. For the Bernoulli simulation, we see similar behavior to the Poisson 
simulation reported here. For the Gaussian data, the DI-SIM algorithm is 
much more competitive, but our algorithm still beats it in almost all cases. 

Overall, the simulation results corroborate the conclusions of Theorem 3.1 
and support the use of biclustering based on the profile log-likelihood crite- 
rion. 
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Fig 1. Average misclassification rates for Poisson example over 100 simulations. 
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Standard deviations for Poisson example over 100 simulations. 
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6. Applications. In this section we use profile-likelihood-based biclus- 
tering to reveal structure in two high-dimensional datasets. For each exam- 
ple, we maximize the profile log-likelihood using the algorithm described in 
Section 5. 

6.1. Movielens. Collaborative filtering uses the behavior of similar con- 
sumers to provide better recommendations of products to new individuals. 
Since consumer habits likely vary depending on products, biclustering can 
help identify structure in the data and identify groups of consumers and 
groups of products with similar patterns. As an application of this we apply 
biclustering to the MovieLens data set (GroupLens Research Project, 1998). 

The MovieLens data set consists of 100,000 movie reviews on 1682 movies 
by 943 users. Each user has rated at least 20 movies and each movie is rated 
on a scale from one to five. In addition to the review rating, the release date 
and genre of each movie is available as well as some demographic information 
about each user including gender, age, occupation and zip code. 

In order to retain customers, movie-renting services strive to recommend 
new movies to individuals who are likely to view them. Since most users only 
review movies that they have already seen, we can use the structure of the 
user-movie review matrix to identify associations between users and viewing 
habits of movies. Specifically, we consider the 943x1682 binary matrix X 
where X%j = 1 if user i has rated movie j and Xy = otherwise. To find 
structure in X, we biclustered the rows and columns of X using the profile 
log-likelihood based on the Bernoulli criterion (5.1a). 

To determine a reasonable selection for the number of biclusters we varied 
the number of user groups from two to three and the number of movie groups 
from two to seven. Qualitatively, the model with three user groups and six 
movie groups seems to provide a parsimonious description of the data. 

Figure 2 presents the heatmap of the data based on the resulting biclus- 
ter assignments, with the ordering of the clusters determined by the total 
number of a reviews in each cluster. Table 2 reports the top ten movies in 
each group. The eclectic mix of genres within each movie group suggests 
that the rating behavior of users is not explained by genre alone. Figure 6.1 
presents a boxplot comparing the distributions of the movie release years 
for each group. We can see a clear ordering of the movie groups by median 
release date. 

The median ages within the user group were 32, 31, and 30.5, and the 
percentages of male users within each group were 67.6%, 72.8%, and 77.8%. 
These statistics suggest that there is some age and gender effect on the 
reviewing habits of the users. 
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Roughly speaking, user group 3 is consistently active across all movie 
groups with increasing activity as the popularity of the movie increases. 
The reviewing habits of user group 2 follow a similar pattern but to a lesser 
extent. In contrast, user group 1 is consistently inactive with the only ex- 
ceptions being movie groups 4 and 6. From Figure 6.1, it appears that the 
users in group 1 only rate recent movies whereas users in groups 2 and 3 
rate movies from all time periods. 

The biclusters discovered here suggest that a movie-renting service should 
recommend under-reviewed movies to individuals in user group 3, and it 
should recommend new releases to users in group 1. This information can 
be used in an ensemble-based recommendation engine like that of Toscher, 
Jahrer and Bell (2009). 

Fig 2. Heatmap generated from MovieLens data reflecting the varying review patterns in 
the different biclusters. Blue identfies movies with no review and white identifies rated 
movies. 
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Table 2 

The top ten movies in each cluster based on the total number of reviews. 



Group 1 


Group 2 


Jade (1995) 
When the Cats Away (1996) 
Jaws 3-D (1983) 
Bastard Out of Carolina (1996) 
Exit to Eden (1994) 
The Ruling Class (1972) 
The Air Up There (1994) 
Bad Taste (1987) 
Stuart Saves His Family (1995) 
Cabin Boy (1994) 


She's the One (1996) 
Jack (1996) 
The Preacher's Wife (1996) 
Striptease (1996) 
Mirror Has Two Faces, The (1996) 
Hercules (1997) 
Kids in the Hall: Brain Candy (1996) 
Jean dc Florette (1986) 
The Fan (1996) 
Extreme Measures (1996) 




Group 3 


Group 4 


The Firm (1993) 
The Abyss (1989) 
Die Hard: With a Vengeance (1995) 
Remains of the Day, The (1993) 
Sneakers (1992) 
The Professional (1994) 
Clerks (1994) 
Reservoir Dogs (1992) 
Like Water For Chocolate (1992) 
Chinatown (1974) 


Courage Under Fire (1996) 
Volcano (1997) 
Murder at 1600 (1997) 
Mars Attacks! (1996) 
The People vs. Larry Flynt (1996) 
Starship Troopers (1997) 
Eraser (1996) 
Das Boot (1981) 
Good Will Hunting (1997) 
The Fifth Element (1997) 




Group 5 


Group 6 


Raiders of the Lost Ark (1981) 

Pulp Fiction (1994) 
The Silence of the Lambs (1991) 
The Empire Strikes Back (1980) 
Back to the Future (1985) 
The Fugitive (1993) 
Indiana Jones and the Last Crusade (1989) 
The Princess Bride (1987) 
Forrest Gump (1994) 
Monty Python and the Holy Grail (1974) 


Star Wars (1977) 
Contact (1997) 
Fargo (1996) 
Return of the Jedi (1983) 
Liar Liar (1997) 
The English Patient (1996) 
Scream (1996) 
Toy Story (1995) 
Air Force One (1997) 
Independence Day (1996) 



Fig 3. Boxplot comparing the different clusters based on the number of movies reviewed 
by each individual. 
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6.2. AGEMAP. Biclustering is commonly used for microarray data to 
visualize the activation patterns of thousands of genes simultaneously. It 
is used to identify patterns and discover distinguishing properties between 
genes and individuals. We use the AGEMAP dataset (Zahn et al., 2007) to 
illustrate this process. 

AGEMAP is a large microarray data set containing the log expression 
levels for 40 mice across 8932 genes measured on 16 different tissue types. 
For this analysis, we restrict attention to two tissue types: cerebellum and 
cerebrum. The 40 mice in the dataset belong to four age groups, with five 
males and five females in each group. One of the mice is missing data for 
the cerebrum tissue so it has been removed from the dataset. 

To study the relationship between mice and gene expression levels, we 
bicluster the 39 x 17864 residual matrix computed from the least squares 
fit to the multiple linear regression model 

Yij = (3 j + PijAi + f3 2j Si + Sij, 

where Yij is the log- activation of gene j in mouse i, A\ is the age of mouse i, 
Sj indicates if mouse i is male, Sij IS cl random error, and (/Soj, /3\j, fyj) is a 
gene-specific coefficient vector. 

The entries of the residual matrix are not independent (for example, the 
sum of each column is zero). Also, the responses of many genes are likely 
correlated with each other. Thus, the block model required by Theorem 3.1 
is not in force, so its conclusion will not hold unless the dependence between 
the residuals is negligible. In light of this caveat, the example should be 
considered as exploratory data analysis. 

We perform biclusterring using profile log-likelihood based on the Gaus- 
sian criterion (5.1c). Based on the preliminary analysis in Perry and Owen 
(2010), it appears that there are three mice groups. To determine an appro- 
priate number of gene groups we experiment with values between two and 
five. Using four gene groups appears to give a reasonable representation of 
the data. The heatmap presented in Figure 6.2 shows the results. 

Although the expression levels for the fourth gene group appear to be 
fairly neutral across the three mouse groups, the first three gene groups 
each have a visually apparent pattern. It appears that a mouse can have 
high expression levels at most two of the first three gene groups. Mouse 
group 1 has high expression for gene groups 2 and 3; mouse group 2 has 
high expression for gene group 1; and mouse group 3 has high expression for 
gene groups 1 and 3. 

The three clusters of mice agree with those found by Perry and Owen 
(2010). That analysis identified the mouse clusters, but could not attribute 
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meaning to them. The bicluster-based analysis has deepend our understand- 
ing of the three mouse clusters while suggesting some interesting interactions 
between the genes. 



Fig 4. Heatmap generated from AGEMAP residual data reflecting the varying expression 
patterns in the different biclusters. The colors for the matrix entries correspond encode to 
the first quartile, the middle two quartiles, and the upper quartile. 
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7. Discussion. In this report we developed a statistical setting for 
studying the performance of biclustering algorithms. Under the assump- 
tion that the data follows a stochastic block model, we derived sufficient 
conditions for an algorithm based on maximizing the profile-likelihood to 
be consistent. This is the first theoretical guarantee for any biclustering al- 
gorithm which can be applied to a broad range of data distributions and 
can handle both sparse and dense data matrices. Our empirical comparisons 
demonstrated that the method performs well in a variety of situations and 
can outperform existing procedures. 

It is important to note that the theoretical results operate under the as- 
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sumption that the true number of row and column classes are known. In 
practice, this is a simplifying assumption and the optimal number of biclus- 
ters needs to be inferred from the data. While this is an open problem in 
the biclustering literature, the formalization of biclustering as an estimation 
process may help identify parallels that can be drawn from classical model 
selection procedures to this setting and is an interesting topic for future 
research. 

Applying the profile-likelihood based biclustering algorithm to real data 
revealed several interesting findings. Our results from the MovieLens dataset 
identified a distinct difference in movie review behavior and split individuals 
who only rate recent movies from individuals who rate movies from all time 
periods. Biclustering the genes and mice in the AGEMAP data exposed an 
interesting pattern in the expression of certain genes and we found that at 
most two gene groups can have high expression levels for any one mouse. 
The consistency theorem proved in this report gives conditions under which 
we can have confidence in the robustness of these findings. 

APPENDIX A: ADDITIONAL TECHNICAL RESULTS 

Lemma A.l. For each n, let X n ^ m , 1 < m < n, be independent random 
variables with EA„ jm = 0. Let p n be a sequence of positive numbers. LetX n 
be a subset of the powerset 2^ n \ with inf{|7| : I £ I n } > L n . Suppose 

(i) ^— Ylm=i E| A„ im | 2 is uniformly bounded in n; 



Proof. Let e > be arbitrary. Define Y n ,m = P n l X n ,m^(\X n , m \ < 
ty/npn), and note that 



(ii) For all e > 0, 



ELi E(|A„ jm | 2 ; \X n , m \ > e^hpn) -»• 0; 




Then 




n 



Pr(y„, m / 



p n x X n<m for some 1 < m < n) < ^ Pr(|X n , m | > E\fnp n ) 



m=l 



< 




n,m 



2. 



X ntm \ > tVnpn) 



-> 0. 
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Fix an arbitrary t > 0. Set ^ n>m = EY n)m and for / G X n define 



m&I 



For I £ I n , write 



Prf ^y n , m >t\I\j = Pr (^(Y„, m - Hn,m) > \I\(t-Hn(I) 



Note that since EX nm = 0, it follows that 



1 



^(Pn, 1 Xn.m'i \X nm \ > E\/np n )\ < = — E(\X nm \ 2 ; |X nm | > Ey/np r , 

£^np n 



\pn,rn\ ~ 

Thus, by (ii) and (iii) we have that sup /g j n {|// ri (/)|} — > 0; in particular, for 
n large enough, sup/ g i ri {|/x n (/)|} < §. Consequently, for n large enough, 
uniformly for all /, 

Pr ( Y, Y n,m > 1 < Pr ( ^(y„, m - p n , m ) > 1 1/|/2) . 

m&I m&I 

Similarly, 

Pr (Z] y ™' m < -*l J l) < Pr (5Z(^,m-^n, m ) < -*|/|/2). 



m&I 



m&I 



We apply Bernstein's inequality to the bound. Define a 2 m = E(Y njm — 
/i„, m ) 2 and u n (I) = Y^ m ei a l,m- Note tnat l^n.m - /Vm| - 2ey^- By Bern- 
stein's inequality, 



Pr ^ ^ ^ (^fi.,m Pn,m j 
m&I 



< 2 eX P 1 / r N id ^/o \ 

L \ vJl)+et\I\Jn /3J 



By (i), (iv), and (v), it follows that for n large enough, f n (I) < £t\I\y/n/?>, 
so 

' } 



t\I\ 



Pr(| (^n.m-Mn.m)! > W*) < 2 exp 

We use this bound for each I to got the union bound: 

Pr ( sup l-i V] F nm > i) < 2|X n | exp |_ = 2exp { log|X ra |- |. 

By (iii) and (iv) , it is possible to choose e such that the right hand side goes 
to zero. □ 
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Proof of Lemma 4.1. For all t > 0, 
Pr (supH-Rfo./OHoo > tj 



<KLPr( sup p- x \ i X H ~ ^dj) 



IlEZn 



> till 



where X n C 2^1 x 2^ is the set of all biclusters such that pk > £ for all k and 
qi > e for all I. Since X n is a subset of the powerset 2^ nm \ by Lemma A.l, it 
follows that 

PT(sap\\R(g,h)\\ 00 >t\ -+0. 



□ 



Proof of Lemma 4.2. The technical assumptions of / imply that its 
first derivative is bounded. Therefore, / is locally Lipschitz continuous with 
Lipschitz constant c = sup f'((i) for fj, in a neighborhood of M. and 

\F(p- l X(g, h),p(g), q{h)) - G Mo (C, D)\ < c\\R(g, h)^. 

From Lemma 4.1, the righthand side converges to zero almost surely and 
the result follows. □ 

Lemma A. 2 (Refined Jensen's Inequality). Let f : R — > R &e tarace 
differentiable and let N be a convex set in R. If x±, . . . ,x n are points in N , 
and ifwi,... ,w n are nonnegative numbers summing to one, then 

n \ n 

y^Wif[xi) - f(z) > - inf /"(y) Vwi(xj - z) 2 , 
i=i i=\ 

where z = Ya=i w i x i- 

Proof. Define k = infygjv" /"(?/) an d use the bound 



/(^)>/(z) + /'(z)(x i -z) + ^(x i -z) 2 . 



□ 



Lemma A.3. Let C = [C ak ] G If^ 4 *^ and Z> = [D u ] € M BxL 5e non- 
negative matrices whose entries sum to one. Let M = [fi a b] £ M^ xB 6e a 
matrix with no two identical columns. Define Ai to be the convex hull of the 
entries of M and let f : R — > R 6e a fioice differentiable convex function 
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with f" bounded away from zero in Ai. Suppose that [Cl] a > rj for all a and 
that DbiDyi > e for some I and b ^ b' . There exists a positive constant C 
depending on M and f such that 

" L ' C'MDi, 



k=l 1=1 



[C T l] k [D T l] 



A B ~„2, 



eye 

a=l 6=1 



<^^[Cl] a [Dl],/M 



Proof. Let I, and b / b' be such that D b iD b n > e. Since no two columns 
of M are identical, there exists an a such that /j, ab ^ /J- a b'- Let k be the index 
of the largest element in row a of matrix C; this element must be at least 
as large as the mean, i.e. 

ak - K ~ K 

Let W = [C T l]k[D T this is nonzero. Now, there exists € M. such 
that 

[C T MD] k i = C ak D h in ab + C ak D b >ifj, ab / + (W - C ak D bi - C ak D b n)(j,*. 

Set k = mt^M f"(ji) and define N = [u ab ] eR AxB with v ah = /(/i a6 ). By 
Lemma A. 2, 

[C T l] k [D T l} lf ( fJ M 3 k ' ) ~ \C T ND] kl < - K -{, ab - » ab ,) ADhlDm 



c j i] fc [u j i],; - J/ "- 2— — W 

This inequality only holds for one particular choice of k and I; for other 
choices, the left hand side is nonpositive by Jensen's inequality. The result 
of the theorem follows, with C defined by 



C = o min (/x a6 -/i ab /) 2 . 

Z a,b^b' 



□ 



Proof of Lemma 4.3. If D ^ Q, then for some I and some 6 / 6', 
DbiD b 'i > <5. By Lemma A. 3, 
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where kq depends only on Mq and /. Similarly, if C ^ V, then the right 
hand side is bounded by —kq t] 2 5/L 2 . The result of the lemma follows with 
K = k / mm{K 2 ,L 2 }. □ 

APPENDIX B: ADDITIONAL EMPIRICAL RESULTS 

This appendix reports additional empirical results for Bernoulli and Gaus- 
sian distributed data. Figures 5-6 present the average bicluster misclassifica- 
tion rates for each sample size and Tables 3-4 report the standard deviations 
for the Bernoulli and Gaussian simulations, respectively. Since the normal- 
ization for the DI-SIM algorithm is only specified for non-negative data, the 
algorithm is run on the un-normalized matrix. 

For the Bernoulli simulation, we simulate from a block model with K = 2 
row clusters and L = 3 column clusters. We vary the number of columns, 
n, between 200 to 1500 and we take the number of rows as m = jn where 
7 G {0.5,1,2}. 

We set the row and column class membership probabilities as p = (0.3, 0.7) 
and q = (0.2, 0.3, 0.5). We choose the matrix of block parameters to be 

_ b_ /0.43 0.06 0.13\ 

~ V - 10 °- 34 °- 17 / ' 

where the entries were selected to be on the same scale as Bickel and Chen 
(2009). We vary b between 5 and 20. We generate the data conditional on 
the row and column classes as Xij \ c, d ~ Bernoulli(// Ci rf j . ). 
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Fig 5. Average misclassification rates for Bernoulli example over 100 simulations. 
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Table 3 

Standard deviations for Bernoulli example over 100 simulations. 
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For the Gaussian simulation, we simulate from a block model with K = 2 
row clusters and L = 3 column clusters. We vary the number of columns, 
n, between 50 to 400 and we take the number of rows as m = where 
7 €{0.5, 1,2}. 

We set the row and column class membership probabilities as p = (0.3, 0.7) 
and q = (0.2, 0.3, 0.5). We choose the matrix of block parameters to be 

M -h( 0,47 0-15 ~ 0m \ 
V-0.26 0.82 0.80y 

where the entries were simulated from a uniform distribution on [—1,1]. We 
vary b between 0.5 and 2. We generate the data conditional on the row and 
column classes as Xij \ c, d ~ Gaussian(/x Ci d j , a = 1). 



Fig 6. Average mis classification rates for Gaussian example over 500 simulations. 
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Table 4 

Standard deviations for Gaussian example over 500 simulations. 
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Similar to the Poisson simulation, biclustering based on the profile log- 
likelihood criterion performs at least as well as the other methods and shows 
signs of convergence in both examples. These results provide further verifi- 
cation of the theoretical findings and support the use of biclustering based 
on the profile log-likelihood criterion. 
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